Algorithms for postprocessing OCR results with visual inter-word constraints

نویسندگان

  • Tao Hong
  • Jonathan J. Hull
چکیده

Algorithms are presented that determine the visual relationships between word images in a document. These include instances of common word images and common substrings that occur often in English language text images. This information is then be used to improve the performance of a commercial optical character recognition (OCR) algorithm. The algorithms presented here calculate clusters of equivalent word images as well as common initial and final substrings. Experimental results are presented that show a 40% reduction in word level error rate is achieved on a test set of documents degraded by uniform noise.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Visual inter-word relations and their use in OCR postprocessing

A technique is presented that uses visual relationships between word images in a document to improve the recognition of the text it contains. This technique takes advantage of the visual relationships between word images that are usually lost in most conventional optical character recognition (OCR) techniques. The visual relations are defined to be the equivalence that exists between images of ...

متن کامل

The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model

The techniques of image processing have been used in optical character recognition (OCR) for a long time. The recognition method evolved from early "pattern recognition" to "feature extraction" recently. The recognition rate is raised from 70% to 90%. But the character by character recognition technique has its limitation. Using language models to assist the OCR system in improving recognition ...

متن کامل

Multifont OCR Postprocessing System

A series of techniques is being developed to postprocess noisy, multifont, nonformatted OCR data on a word basis to 1 ) determine if a field is alphabetic or numeric; 2) verify that an alphabetic word is legitimate; 3 ) fetch from a dictionary a set of potential entries using a garbled word as a key; and 4) error-correct the garbled word by selecting the most likely dictionary word. Four algori...

متن کامل

Improving OCR Performance in Biomedical Literature Retrieval through Preprocessing and Postprocessing

Today’s information retrieval (IR) techniques are mostly text-based. As a consequence, some types of information are beyond the reach of text-based IR systems, which fail in situations where textual information can not be easily accessed, e.g. textual information in biomedical images and figures. To tackle such situations, we propose to augment IR systems with the ability to perform optical cha...

متن کامل

A Statistical Approach to Automatic OCR Error Correction in Context

This paper describes an automatic, context-sensitive, word-error correction system based on statistical language modeling (SLM) as applied to optical character recognition (OCR) postprocessing. The system exploits information from multiple sources, including letter n-grams, character confusion probabilities, and word-bigram probabilities. Letter n-grams are used to index the words in the lexico...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995